A Proactive Fault Tolerance Scheme for Large Scale Storage Systems
نویسندگان
چکیده
Facing increasingly high failure rate of drives in data centers, reactive fault tolerance mechanisms alone can hardly guarantee high reliability. Therefore, some hard drive failure prediction models that can predict soon-to-fail drives in advance have been raised. But few researchers applied these models to distributed systems to improve the reliability. This paper proposes SSM (Self-Scheduling Migration) which can monitor drives’ health status and reasonably migrate data from the soon-to-fail drives to others in advance using the results produced by the prediction models. We adopt a self-scheduling migration algorithm into distributed systems to transfer the data from soon-to-fail drives. This algorithm can dynamically adjust the migration rates according to drives’ severity level, which is generated from the realtime prediction results. Moreover, the algorithm can make full use of the resources and balance the load when selecting migration source and destination drives. On the premise of minimizing the side effects of migration to system services, the migration bandwidth is reasonably allocated. We implement a prototype based on Sheepdog distributed system. The system only sees respectively 8% and 13% performance drops on read and write operations caused by migration. Compared with reactive fault tolerance, SSM significantly improves system reliability and availability.
منابع مشابه
Design of Fault-Tolerant Large-Scale VOD Servers: With Emphasis on High-Performance and Low-Cost
ÐRecent technological advances in digital signal processing, data compression techniques, and high-speed communication networks have made Video-on-Demand (VOD) servers feasible. A challenging task in such systems is servicing multiple clients simultaneously while satisfying real-time requirements of continuous delivery of objects at specified rates. To accomplish these tasks and realize economi...
متن کاملProactive Service Migration for Long-Running Byzantine Fault Tolerant Systems
In this paper, we describe a novel proactive recovery scheme based on service migration for long-running Byzantine fault tolerant systems. Proactive recovery is an essential method for ensuring long term reliability of fault tolerant systems that are under continuous threats from malicious adversaries. The primary benefit of our proactive recovery scheme is a reduced vulnerability window. This ...
متن کاملFailure prediction for HPC systems and applications: Current situation and open issues
As large-scale systems evolve towards post-petascale computing, it is crucial to focus on providing fault-tolerance strategies that aim to minimize fault’s effects on applications. By far the most popular technique is the checkpoint–restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and proactive measures are taken. Th...
متن کاملA proactive fault tolerance framework for high performance computing (HPC) systems in the cloud
As high-performance computing (HPC) systems continue to increase in scale, their mean-time to interrupt decreases respectively. The current state of practice for fault tolerance (FT) is checkpoint/restart. However, with increasing error rates, increasing aggregate memory and not proportionally increasing I/O capabilities, it is becoming less efficient. Proactive FT avoids experiencing failures ...
متن کاملTowards a Secure Fragment Allocation of Files in Heterogeneous Distributed Systems
There is a growing demand for large-scale distributed storage systems to support resource sharing and fault tolerance. Although heterogeneity issues of distributed systems have been widely investigated, little attention has been given to security solutions designed for distributed storage systems with heterogeneous vulnerabilities. To address this issue, we design a Secure Fragment Allocation S...
متن کامل